In this lab, we will introduce the basic functionality of R, together with some simple plotting functions. We will be using the following files for these examples:
More information about this dataset can be found here: https://allisonhorst.github.io/palmerpenguins/index.html
In this and subsequent labs, code that can be entered into R will be high-lit, e.g.:
plot(x, y)
And R output will be formatted with ## at the start of
the line. File names will be given in italics and will be
available in the ‘data’ directory on the course Canvas site.
The R Studio interface consists of several windows. Start R Studio from the ‘Start’ menu under Windows, and the following window should appear:
> prompt and
R will then execute your command. This is the most important window,
because this is where R actually does stuff.Note that you can rearrange the order of these panels, so this may look different on other computers.
Much of your time spent with R will involve typing commands in at the console, and R Studio has some help with this.
pri and pressing ‘Tab’ - you should
see print as part of the list, and you can click on this,
or scroll down to use it from the list.R has a workspace where variables and data are stored as you use it. This is held in the memory of the computer, so if you are working from a file, you will need to read it in to the R workspace, and then work on the values held in memory. This means that you only access files to read in or write out data, the rest of the time you are working on a copy in the workspace.
R defines the working directory as the folder in which it is currently working. When you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or plot, it will save it in the working directory. Once you have done this, download all the files from the training session Google drive and move them to the data folder.
For this class, the labs will assume that you have your files organized according to the following structure:
+-- ugic2024
| +-- data
To do this, go to your Documents folder, and create a
new folder called ugic2024. In this now create a new folder
called data (where we will store all the data used
today.
Once you have created these folders, we need to change R’s working
directory so that it is pointing to ugic2024. The easiest
way to do this is by going to the [Session] menu in RStudio, then
[Change working directory]. This will open a file browser that you can
use to browse through your computer and find the folder. (If you are
using the base version of R, go to [File] \(>\) [Change dir…] in Windows, or [Misc]
\(>\) [Change Working Directory] in
Mac.)
You can also change the working directory manually using the
setwd() function in the console. To do this, you may need
to know the full path to the folder on your computer. If you followed
the instructions given above, this should be:
C:/Users/username/Documents/ugic2024/Users/username/Documents/ugic2024Where username is your name on the computer. You can
also find this path by
- Use the File Explorer to select the folderugic2024`
ugic2024setwd()
command. Go to the console window in RStudio and enter the following
code:setwd("")
And paste your directory between the quotes. The code should look something like this (but with your actual user name):
setwd("C:/Users/username/Documents/ugic2024/")
Note that the slashes are forward slashes and don’t forget the quotations. R is case sensitive, so make sure you write capitals where necessary. To check that you have correctly changed directory, enter the following command, which will show you the current working directory:
getwd()
You can also use relative paths. If your current working directory is
ugic2024 and you want to change to data, enter
the following code (where the ./ changes the directory to a
level higher than the current one).
setwd("./data")
If your current working directory is data and you want
to change to ugic2024, enter the following code (where the
../ changes the directory to a level below the current
one).
setwd("../")
Before proceeding with the rest of today’s lab, make sure to change
your working directory back to ugic2024.
If this all seems a little foreign to you, don’t worry - there will be plenty of opportunities to practice this over the day. Understanding the directory structure is very important in being able to manage your files both for this training session and any analysis you will do later.
In the console, the ‘>’ is the prompt, and your
commands will be entered here. Click on the console window, then enter
the following:
2+2
## [1] 4
And press ‘Enter’, and R will tell you, not too surprisingly, that
2+2=4. The spacing is not important: you could equally enter
2 + 2 or 2+ 2 and get the same result. The
[1] before the output is a vector index. It refers to the
first value in the output vector (here a vector of length 1).
We’ll be using this later.
We can equally use standard math functions, for example, to take the natural log or square root of 2:
log(2)
## [1] 0.6931472
sqrt(2)
## [1] 1.414214
So far, these commands have runs some calculations and displayed the
results in the console. If you need to store any R output for further
use, you will need to assign to to a variable. There are two assignment
operators in R <- and =. These are
interchangeable, and you will see both used in R examples. To store the
results of the previous commands:
a = log(2)
b = sqrt(2)
If you now look in the top right corner in the ‘Environment’ window, you should see these two variables appear. These are now held in R’s workspace and can be reused:
a + b
## [1] 2.107361
#File input and output
R can use many different file types, but comma separated value (csv)
files are most frequently used as the easiest way to transfer between R
and Excel. Make sure you have changed your working directory to the
ugic2024 folder. Then get a list of csv files in the
data folder as follows (note the use of the pattern
parameter to get only certain files):
list.files("./data/", pattern=".csv")
## [1] "penguin2.csv" "penguins.csv"
Let’s read in the data from the Penguin file (penguins.csv,
make sure this appeared in the list from the previous command and ask me
if you don’t see it). CSV files can be read in using the
read.csv() function:
penguin <- read.csv("./data/penguins.csv")
Note that because this file is held in a different folder
(data) to your current working directory
(ugic2024), you need to provide the relative path
(./data).
The first part of this code (penguin <-) tells R to
store the data read in from the file in a data frame called
penguin. To print out the contents of any object in R,
simply type the name of that object at the command prompt.
penguin
Other useful commands are class() to see what data class
an object is (a dataframe), and names() to get a list of
the column headers. The function str() is probably the most
useful, describing the column names and the type of data stored in
them.
class(penguin)
names(penguin)
str(penguin)
## 'data.frame': 344 obs. of 8 variables:
## $ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : chr "male" "female" "female" NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
You can write out dataframes to csv files with
write.csv(). Here we’ll write the penguin data back out to
a new file (normally, we’d only do this after transforming or working
with the data, but this is just an example). Once you’ve run this, check
the data folder to make sure the new file has been
created.
write.csv(penguin, "./data/penguin2.csv", row.names = FALSE)
You can also write out the entire contents of R’s workspace to a
binary file (*.RData), with the save function. This can be
useful if you want to save a snapshot of your current R session with all
the variables and objects you are working with. To save the current
workspace:
save(list = ls(), file = "dat.RData")
We can now clear the current workspace (check the environment window before and after you do this):
rm(list = ls())
And if we now reload the binary file, you should see that the
variables we had previously (penguin, a,
b) reappear in the environment.
load(file = "dat.RData")
When we read in the contents of the file, it was store as a dataframe
with a variable name penguin - this represents the whole
dataset. Normally however, you will need to work with subsets of the
data (e.g. individual columns or sets of rows). There are two common
ways to do this, one with base R and one using the tidyverse, that we
will explore here. The tidyverse syntax is becoming more prevalent,
especially for data science work, but there are still times when
understanding the base R approach can be useful, e.g. when working with
matrices or arrays.
If you want to use base R to access subsets of the data frame or individual values, you will need to understand how R indexes data. R uses a 1-based indexing, which means that the first value in any set of data will be indexed at 1, then at 2, etc. This is in contrast to some other languages (e.g. Python) that use 0-based indexing.
To show how this works, we’ll first create a vector of random numbers and print the content (note that your values will be different):
x <- rnorm(10)
x
## [1] 1.37095845 -0.56469817 0.36312841 0.63286260 0.40426832 -0.10612452
## [7] 1.51152200 -0.09465904 2.01842371 -0.06271410
To access the first value:
x[1]
## [1] 1.370958
Or to access the third:
x[3]
## [1] 0.3631284
You can also use the colon : to slice the data,
i.e. extract the data between two indices:
x[1:3]
## [1] 1.3709584 -0.5646982 0.3631284
Or you can use a set of irregular indices by concatenating them
together with the c() function:
x[c(3,5,7)]
## [1] 0.3631284 0.4042683 1.5115220
You can also use a reverse index with a - symbol. This
will extract all values except the one with that index:
x[-1]
## [1] -0.56469817 0.36312841 0.63286260 0.40426832 -0.10612452 1.51152200
## [7] -0.09465904 2.01842371 -0.06271410
Matrices in R are index by [row,col], so you need to
provide both of these to extract subsets of data. Here we’ll make up a
small matrix with values from 2 to 24 on a step of 2:
x <- matrix(seq(2, 24, by = 2), nrow = 3, ncol = 4)
To get the very first entry, we index it with [1,1]:
x[1,1]
## [1] 2
And to get the last:
x[3,4]
## [1] 24
If you only provide one index, then it will extract all values for that row/columns:
x[1,] # first row
## [1] 2 8 14 20
x[,4] # fourth column
## [1] 20 22 24
As before, you can use : to extract slices. This will
extract all values in the 2nd to 4th columns
x[, 2:4]
## [,1] [,2] [,3]
## [1,] 8 14 20
## [2,] 10 16 22
## [3,] 12 18 24
Or the first row for the same columns:
x[1, 2:4]
## [1] 8 14 20
At this point, you may be wondering how all this relates to data frames - which is how the penguin data was stored when it was read in from the csv file earlier. Data frames are similar to 2D matrices as they are composed of rows (representing observations) and columns (representing variables). As a result, you can use the same row/column indexing to access subsets of data:
penguin[ ,4] # 4th column
penguin[10, ] # 10th row
And as before, you can access a range of rows and columns using
::
penguin[ ,1:4] # Columns 1 to 4
penguin[1:10, ] # First 10 rows
penguin[1:50,1:2] # First 50 rows of the first two columns
Dataframes also use a $ notation, which allows to access
individual columns or variables. For example, to extract the bill length
variable:
names(penguin)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
penguin$bill_length_mm # Extract single column
The advantage of this is that you can access a column, even if you don’t know wthe column number. You can also combine this with the vector indexing (see above) to access specific values within a vector:
penguin$bill_length_mm[3] # 3rd element
penguin$bill_length_mm[-3] # All but 3rd element
penguin$bill_length_mm[1:10] # First 10 elements
Logical operators \(<, <=, >, >=, ==, !=\) can be used to select parts of the data set by value. This is very useful if you only want to analyze part of your dataset:
penguin$bill_length_mm[penguin$bill_length_mm > 40] # All bill lengths over 40 mm
penguin[penguin$bill_length_mm > 40, ] # All columns with bill length length over 40
penguin$bill_length_mm[(penguin$species == 'Adelie')] # All "Adelie" penguins
penguin[(penguin$species == 'Adelie'), ] # All columns for "Adelie" penguins
These operators can be combined, so to get all instances of “Adelie” species with bill lengths greater than 40:
penguin[(penguin$species == 'Adelie') & (penguin$bill_length_mm > 40), ]
The main difference is that dataframes can contain different data classes, where as matrices and vectors can only contain a single class. Compare the matrix we made earlier:
class(x)
## [1] "matrix" "array"
class(x[, 1])
## [1] "numeric"
class(x[, 2])
## [1] "numeric"
To the content of the dataframe
class(penguin)
## [1] "data.frame"
class(penguin$bill_length_mm)
## [1] "numeric"
class(penguin$species)
## [1] "character"
The tidyverse is a set of add-on packages for R that are largely designed to streamline working with and visualizing data in R. One of these packages (dplyr) is used to transform and summarize tabular data with rows and columns, like R’s dataframes. It can be used as a replacement for much of the indexing described above. The package contains a set of functions (or “verbs”) that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.
As this is an add-on package, you will need to download and install it. YOu can install the entire set of tidyverse packages with:
install.packages("tidyverse")
Or just dplyr with:
install.packages("tidyverse")
Note that you will only need to install this once, but each time you
want to use it in a new R session, you will need to load it into memory
using the library function:
library(dplyr)
dplyr has a wide variety of operations, but the most important ones are:
| Verb | Description |
|---|---|
select() |
select columns |
filter() |
filter rows |
arrange() |
re-order or arrange rows |
mutate() |
create new columns |
summarise() |
summarize values |
group_by() |
allows for group operations |
These functions are commonly used with a pipe operator, which can be
used to chain functions together. This is loaded automatically with
dplyr so you don’t need to worry about loading it. The
pipe operator is written as %>% (or |>),
and takes the output from one function and pipes it directly to
another function. In the first example below, we’ll look at using one of
these functions without and then with the pipe, and then use the pipe
throughout the rest of this lab.
The select() function allows the selection of columns by
name. To use this without the pipe, add the data frame name in the
function parentheses. Here we select two columns from the
penguin dataframe, and then use the head function to show
the first few rows.
spp_yr <- select(penguin, species, year)
head(spp_yr) # Show the first 6 rows
## species year
## 1 Adelie 2007
## 2 Adelie 2007
## 3 Adelie 2007
## 4 Adelie 2007
## 5 Adelie 2007
## 6 Adelie 2007
We can rewrite this as a single line using the %>%
pipe and avoiding the need for the intermediate data frame
spp_yr. Note there are two pipes: the first sends the
dataframe to the select() function, and the second sends
the output of this function to the head() function:
penguin %>%
select(species, year) %>%
head()
## species year
## 1 Adelie 2007
## 2 Adelie 2007
## 3 Adelie 2007
## 4 Adelie 2007
## 5 Adelie 2007
## 6 Adelie 2007
For most of these examples, we won’t save the output, but you can easily store this by assigning it to a new variable:
spp_yr <- penguin %>%
select(species, year)
To select all the columns except a specific column, use the “-” (subtraction) operator:
penguin %>%
select(-year) %>%
head()
To select a range of columns by name, use the “:”
(colon) operator (as we did with selecting multiple columns using
indices)
penguin %>%
select(bill_length_mm:body_mass_g) %>%
head()
To select all columns that start with the character string
“bill”, use the function starts_with()
penguin %>%
select(starts_with("bill"))
Here are some additional functions to select columns based on a specific criteria:
ends_with() = Select columns that end with a character
stringcontains() = Select columns that contain a character
stringmatches() = Select columns that match a regular
expressionone_of() = Select columns names that are from a group
of namesThe filter() function allows the selection of rows. To
filter the data for rows where the body mass is over 4000 grams:
penguin %>%
filter(body_mass_g > 65)
You can add mulitple conditions in the filter function. For example, to filter for male penguins with a body mass of over 4000 g:
penguin %>%
filter(sex == "male", body_mass_g > 4000)
To filter for male penguins with a body mass of over 4000 g, and are from Torgensen island:
penguin %>%
filter(sex == "male", body_mass_g > 4000, island == "Biscoe")
Filter for female penguins from Biscoe and Torgensen islands
penguin %>%
filter(sex == "female", island %in% c("Biscoe","Torgensen"))
Note that there is also a slice() function, which simply
extracts rows according to their position - this is the equivalent to
the indexing we have done before:
penguin %>%
slice(1:10)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42.0 20.2 190 4250
## sex year
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 <NA> 2007
## 5 female 2007
## 6 male 2007
## 7 female 2007
## 8 male 2007
## 9 <NA> 2007
## 10 <NA> 2007
The real power of dplyr comes when you start to combine functions. For example here, we’ll extract just the body mass and species name for the penguins from Dream island:
penguin %>%
filter(island == "Dream") %>%
select(species, body_mass_g)
arrange() functionThis function allows us to arrange (or re-order) rows by a particular
column. So to arrange by increasing life expectancy (I’m using the
head() function to limit the amount of data that appears on
the screen):
penguin %>%
arrange(bill_length_mm) %>%
head()
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Dream 32.1 15.5 188 3050
## 2 Adelie Dream 33.1 16.1 178 2900
## 3 Adelie Torgersen 33.5 19.0 190 3600
## 4 Adelie Dream 34.0 17.1 185 3400
## 5 Adelie Torgersen 34.1 18.1 193 3475
## 6 Adelie Torgersen 34.4 18.4 184 3325
## sex year
## 1 female 2009
## 2 female 2008
## 3 female 2008
## 4 female 2008
## 5 <NA> 2007
## 6 female 2007
And you can include the desc() function to reverse the
sort order:
penguin %>%
arrange(desc(bill_length_mm)) %>%
head()
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Gentoo Biscoe 59.6 17.0 230 6050
## 2 Chinstrap Dream 58.0 17.8 181 3700
## 3 Gentoo Biscoe 55.9 17.0 228 5600
## 4 Chinstrap Dream 55.8 19.8 207 4000
## 5 Gentoo Biscoe 55.1 16.0 230 5850
## 6 Gentoo Biscoe 54.3 15.7 231 5650
## sex year
## 1 male 2007
## 2 female 2007
## 3 male 2009
## 4 male 2009
## 5 male 2009
## 6 male 2008
Note that you can add additional variable names to
arrange() to sort on consecutive columns.
Now, we will filter for female penguins, select four columns from the
dataset, arrange the rows by the increasing body_mass then pass this to
head() to show the first few rows:
penguin %>%
filter(sex == "female") %>%
select(island, year, body_mass_g, flipper_length_mm) %>%
arrange(body_mass_g) %>%
head()
distinct() functionThe function distinct() will return the unique values of
a vector, so to get the list of islands in the penguin data
set
penguin %>%
distinct(island)
So, if we want the list of islands that have female penguins with over 4000 g body mass:
penguin %>%
filter(sex == "female", body_mass_g > 4000) %>%
distinct(island)
mutate() functionThis function can be used to add new columns to the data frame. We’ll use this to create a new column of with the body mass in kilograms:
penguin %>%
select(island, body_mass_g) %>%
mutate(body_mass_kg = body_mass_g / 1000)
And let’s sort by this by weight to find the islands with the highest masses:
penguin %>%
select(island, body_mass_g) %>%
mutate(body_mass_kg = body_mass_g / 1000) %>%
arrange(desc(body_mass_kg)) %>%
head()
Note that this does not add the column to the original dataframe, unless you assign the output:
penguin <- penguin %>%
mutate(body_mass_kg = body_mass_g / 1000)
names(penguin)
We’ll look at a couple of other dplyr functions, shortly, but first, we’ll explore what an R function.
Functions typically are comprised of the name of the function
(sqrt for taking square roots) and a set of parentheses.
The parentheses are used to pass data to the function as well as setting
parameters to change the behavior of the function.
sqrt(5)
## [1] 2.236068
Note that we can use the assignment operator to save the output from a function, allowing you to use this in subsequent functions and analyses.
y <- sqrt(5)
round(y)
## [1] 2
To save time and code, functions can be combined:
round(sqrt(5))
## [1] 2
Most functions take a series of arguments that control the
way they work. As an example, we’ll look at the seq()
function, which produces a series of numbers on a regular step. By
default, it require 3 arguments, the starting number, the ending number
and the step.
seq(from = 0, to = 20, by = 2)
## [1] 0 2 4 6 8 10 12 14 16 18 20
If you include the argument names, as in this example, the order does
not matter. The argument names can be omitted if you keep to the
specified order of arguments So seq(0,20,2) will give you
the equivalent results.
To find out what these arguments are, what they are called and what
values they take, use the help() function,
e.g. help(seq) or just ?seq. This will open a
window with the help file for that function. If you do not know the name
of a function, there is a search function help.search(), or
use the help browser help.start(); browse to packages or
use the search engine.
R has a large number of inbuilt functions. This section is designed to simply introduce you to the some basic functions for estimating simple univariate statistics We’ll start by simply calculating the mean of the bill length values
mean(penguin$bill_length_mm)
## [1] NA
This returns the value NA, rather than a mean length. So
what went wrong? In the original set of data, there are some missing
values, also denoted by NA.
penguin$bill_length_mm[1:15]
## [1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
R’s default for most functions is to not calculate values when there
are missing observations. This is really to alert you to the fact that
the data are incomplete, and the value you would obtain might be biased.
You can overrule this by adding the argument na.rm=TRUE to
the following functions. This removes NAs and calculates
the value with whatever is leftover.
Functions to describe the central tendency:
mean(penguin$bill_length_mm, na.rm = TRUE)
## [1] 43.92193
median(penguin$bill_length_mm)
## [1] NA
Functions to describe the dispersion (output not shown):
sd(penguin$bill_length_mm, na.rm = TRUE) ## Standard deviation
var(penguin$bill_length_mm) ## Variance
min(penguin$bill_length_mm)
max(penguin$bill_length_mm, na.rm = TRUE)
quantile(penguin$bill_length_mm, na.rm = TRUE)
Note that quantile() takes a parameter that allows you
to choose the quantile to be calculated,
e.g. quantile(bl, c(0.1,0.9), na.rm = TRUE), will calculate
the 10th and 90th percentile. Try adapting this to calculate the 25th
and 75th percentile.
Some other useful functions:
sum(penguin$bill_length_mm, na.rm = TRUE)
## [1] 15021.3
table(penguin$species)
##
## Adelie Chinstrap Gentoo
## 152 68 124
summary(penguin$bill_length_mm)
As R is object oriented, functions will adapt to different data types. Compare:
summary(penguin$bill_length_mm) ## Summary of numeric vector
summary(penguin$species) ## Summary of numeric vector
summary(penguin) ## Summary of data frame
summarize() functionThese functions can be used with the dplyr function
summarize() to create summary statistics for a given column
in the data frame, for example, finding the mean. To compute the average
body mass, use the mean() function with the new column
body_mass_kg.
penguin %>%
summarise(avg_body_mass = mean(body_mass_kg, na.rm = TRUE))
## avg_body_mass
## 1 4.201754
You can use most of the functions that calculate summary statistics
described above. There are a number of others that are useful here,
including n() to ge the length of a vector:
penguin %>%
summarise(count = n())
## count
## 1 344
And n_distinct() returns the number of distinct values
in vector.
penguin %>%
summarise(count_islands = n_distinct(island),
count_species = n_distinct(species))
## count_islands count_species
## 1 3 3
We can then easily set up a function to calculate a range of summary statistics as follows:
penguin %>%
summarise(avg_body_mass = mean(body_mass_kg, na.rm = TRUE),
sd_body_mass = sd(body_mass_kg, na.rm = TRUE),
min_body_mass = min(body_mass_kg, na.rm = TRUE),
max_body_mass = max(body_mass_kg, na.rm = TRUE))
## avg_body_mass sd_body_mass min_body_mass max_body_mass
## 1 4.201754 0.8019545 2.7 6.3
Although this may not seem that useful (relative to using the
functions on their own), combining this with a second function
(group_by) really expands this use.
group_by() functionThe group_by() function is a very useful addition to
these other functions. It is related to concept of
“split-apply-combine”, that for many analyses, we literally want to
split the data frame by some variable (e.g. island or year), apply a
function to the individual data frames and then combine the output.
Let’s do that: split the penguin data frame by species,
calculate summary statistics (as above), then return everything in a new
data frame, giving a set of summary statistics for each country.
penguin %>%
group_by(species) %>%
summarise(avg_body_mass = mean(body_mass_kg, na.rm = TRUE),
sd_body_mass = sd(body_mass_kg, na.rm = TRUE),
min_body_mass = min(body_mass_kg, na.rm = TRUE),
max_body_mass = max(body_mass_kg, na.rm = TRUE))
## # A tibble: 3 × 5
## species avg_body_mass sd_body_mass min_body_mass max_body_mass
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 3.70 0.459 2.85 4.78
## 2 Chinstrap 3.73 0.384 2.7 4.8
## 3 Gentoo 5.08 0.504 3.95 6.3
We can also group by two variables, here, we’ll just calculate the mean body mass by species and island, and arrange by the mean:
penguin %>%
group_by(species, island) %>%
summarise(avg_body_mass = mean(body_mass_kg, na.rm = TRUE)) %>%
arrange(avg_body_mass)
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
We’ll now look quickly at the basic plotting functions in R. R has a
wide range of plotting types, and we will look at some more complex
methods later in this class. For now, we will concentrate on the basic
plotting function (plot()) and how to simply modify
this.
The basic R function for plotting (plot()) will produce
a scatter plot of two variables:
plot(penguin$bill_length_mm, penguin$bill_depth_mm)
As we know that these values come from three difference species, we
can use this knowledge to add extra information to the plot, by using
the col parameter. We first convert the
Species vector to a factor class. R will then use the
levels of the factor to assign colors to the points. We can also change
the symbol type using the pch parameter.
penguin$species <- as.factor(penguin$species)
plot(penguin$bill_length_mm, penguin$bill_depth_mm,
col = penguin$species, pch = 16)
Let’s clean up this plot a little by specifying the axis labels and a title:
plot(penguin$bill_length_mm, penguin$bill_depth_mm,
col = penguin$species,
pch = 16,
xlab = "Bill length (mm)",
ylab = "Bill depth (mm)",
main = "Penguin size measurements (Palmer Archipelago)")
We can also add a legend to our plot to explain the different colors
and symbols. Unfortunately, R makes you do all the work for this, using
the legend() function. Here, we add a legend to the top
left of the plot, giving the labels for each color and the color
used:
legend("bottomleft",
legend = c("Adelie","Chinstrap","Gentoo"),
col = c(1,2,3),
pch = 16)
Histograms are commonly used to visualize the distribution of a set of values. These are ‘binned’ into a set of classes, and the histogram represents the frequency of occurrences in that bin.
hist(penguin$bill_length_mm)
Bins can be defined with the breaks parameter, which may
be set to a constant number in which case the data range is split into
that many bins, or as a sequence of numbers defining the intervals
between bins. In this latter case, we can make use of the
seq() function from earlier.
hist(penguin$bill_length_mm, breaks = 20)
hist(penguin$bill_length_mm, breaks = seq(30, 60, 2.5))
An alternative to histograms are boxplots, which show information about the data quartiles. Here the box represents the interquartile data (25-75% of the data), the thick bar is the median, and the ‘whiskers’ show the data range.
boxplot(penguin$bill_length_mm)
More usefully, we can look at boxplots across a set of classes.
boxplot(penguin$bill_length_mm ~ penguin$species,
ylab = 'Bill length (mm)')
Note that this code uses a tilde (\(\sim\)) between the variable and the set of factors. The tilde is often used to define dependency between two variables, and we will return to this again during the modeling part of this class.
By default, R plots graphics to the screen, but has the ability to
save figures in most of the standard graphic formats. In order to do
this, you first need to open a file (a graphics device), then
run the plotting functions, then close the device. Remember that you
need to plot all the layers of a figure before closing the file. The
following example plots the penguin bill length data to a pdf file.
Alternatives include: png, jpeg, svg, etc.; type
help(Devices) for more details.
pdf("penguin_boxplot.pdf")
boxplot(penguin$bill_length_mm ~ penguin$species,
ylab = 'Bill length (mm)')
dev.off()
Alternatively, you can copy-paste directly into Word by going to [Export] -> [Copy to clipboard…] in R Studio’s plotting window.
The ggplot2 package forms part of the tidyverse set
of packages, and implements the Grammar
of Graphics framework developed by Leland Wilkinson. The idea behind
the grammar of graphics is that all plots can be described by a common
language, rather than considering them as separate barplots, line
charts, etc. What differs is the coordinate system and geometry
used to place the data on the page. Using this package requires a little
more work than standard plots, but the results are usually worth while.
If you installed the tidyverse set of packages earlier, you will already
have this on your system. If not, install it with
install.packages("ggplot2"). Then load the library:
library(ggplot2)
In order to understand how ggplot makes a figure, we
need to establish what the fundamental parts are of every data graph.
They are:
When making a ggplot figure, we generally start by
creating the base figure. To do this we need to tell the function where
the data is coming from, and the base aesthetic (i.e. which variable is
x, which is y?). To remake the first scatter
plot with ggplot(), we start by doing the following:
ggplot(penguin, aes(x = bill_length_mm, y=bill_depth_mm))
This creates the base plot with correct axes, but as you’ll see, there
is no data on it. This is because we have not specified the geometry -
the way in which we want the data to be plotted.
ggplot2 works as a series of layers, so we can add
(
+) the geometry to this base as follows:
ggplot(penguin, aes(x = bill_length_mm, y=bill_depth_mm)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
We can specify symbols and colors for the points in the
geom_ function:
ggplot(penguin, aes(x = bill_length_mm, y=bill_depth_mm)) +
geom_point(color = 'darkorange', shape = 4, size = 4)
## Warning: Removed 2 rows containing missing values (`geom_point()`).
More usefully, we can use any of the categorical variables to set the color. To do this, we need to use them when creating the aesthetic, so that different colors will be used for each class. This also adds a legend directly.
ggplot(penguin, aes(x = bill_length_mm,
y = bill_depth_mm,
col = species)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
You can also add a shape aesthetic using another categorical variable
(here we’ll at the different sexes):
ggplot(penguin, aes(x = bill_length_mm,
y = bill_depth_mm,
col = species,
shape = sex)) +
geom_point()
## Warning: Removed 11 rows containing missing values (`geom_point()`).
Note that there are some missing values in the sex
column. This allows us to demonstrate another neat aspect of
ggplot2 - it can be combined with
dplyr for preprocessing. Here, we’ll use the
filter() function and is.na() to remove the
missing values and pipe the output directly to ggplot:
penguin %>%
filter(!is.na(sex)) %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm,
col = species,
shape = sex)) +
geom_point()
We can next add scales to set the x and y label. These functions can do
quite a bit more, e.g. logarithmic scaling or scaling colors. We’ll also
set a theme to remove the default gray background (there’s a few of
these, and several more that can be added with the
ggthemes package).
penguin %>%
filter(!is.na(sex)) %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm,
col = species,
shape = sex)) +
geom_point() +
scale_x_continuous("Bill Length (mm)") +
scale_y_continuous("Bill Length (mm)") +
theme_bw()
We’ll just make one more adjustment to this figure. We’ll add a
facet layer. This splits the original plot into a set of
small multiples, based on another categorical variable (here we’ll use
the year).
penguin %>%
filter(!is.na(sex)) %>%
ggplot(aes(x = bill_length_mm,
y = bill_depth_mm,
col = species,
shape = sex)) +
geom_point() +
scale_x_continuous("Bill Length (mm)") +
scale_y_continuous("Bill Length (mm)") +
theme_bw() +
facet_wrap(~year)
ggplot(penguin, aes(x = bill_length_mm)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
ggplot(penguin, aes(x = bill_length_mm, fill = species)) +
geom_histogram(position = 'identity', alpha = 0.7)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
ggplot(penguin, aes(x = species, y = bill_length_mm)) +
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
ggplot(penguin, aes(x = species, y = bill_length_mm, fill = sex)) +
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
When you are finished with R, exit by typing q() in the
console (or going to [File] \(>\)
[Quit R Studio]). You will be asked if you want to save your workspace.
This is generally a good idea, as this will create a file containing all
your current data (“.RData”), and the history (“.Rhistory”) of the
commands you have used. If you restart R in the same directory, by
clicking on an R script file, the workspace will be loaded
automatically. If it doesn’t, you can load this by changing to the the
correct working directory and typing:
load(".RData")
Working in R can be frustrating, with errors and warnings popping up from you know not where. In fact, problems arise so often, troubleshooting (which usually just means Googling) should be considered an inescapable component of programming.
Here are some of the best places to look for help: